Method and system for on-demand scratch register renaming

ABSTRACT

A method and processor for performing on-demand scratch register reallocation by dynamically adjusting the number of scratch registers from within the pool of rename registers includes initially allocating from a set of physical registers one or more architected registers and a pool of one or more rename registers and allocating from the pool of rename registers an initial number of scratch registers for storing microcode operands. In response to detecting that a fetched instruction requires an additional scratch register beyond the initial number, a selected physical register is reallocated from among the pool of rename registers as the additional scratch register, and a flag is set to indicate the rename register is allocated as the additional scratch register. In response to determining that the additional scratch register is no longer needed, the additional scratch register is deallocated and the flag is reset, such that the selected physical register returns to the pool of rename registers.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 11/390,785, filed on Mar. 28, 2006, and entitled “Method andSystem for On-Demand Scratch Register Renaming” which is assigned to theassignee of the present invention and incorporated herein by referencein its entirety.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention-relates in general to data processing and, inparticular, to resource allocation in a processor. Still moreparticularly, the present invention relates to a method and system foron-demand scratch register renaming in a processor.

2. Description of the Related Art

A typical superscalar microprocessor is a highly complex digitalintegrated circuit including, for example, one or more levels of cachememory for storing instructions and data, a number of execution unitsfor executing instructions, instruction sequencing logic for retrievinginstructions from memory and routing the instructions to the variousexecution units, and registers for storing operands and result data.Interspersed within and between these components are various queues,buffers and latches for temporarily buffering instructions, data andcontrol information. As will be appreciated, at any one time the typicalprocessor described above contains an enormous amount of stateinformation, which can be defined as the aggregate of the instructions,data and control information present within the processor.

Many microprocessors implement microcode to break complex instructionsinto smaller operations (a.k.a. internal ops, or iops). To transfer databetween iops, the prior art solution defines a small fixed number ofGeneral Purpose Registers (GPRs) as scratch registers (a.k.a. extendedGPRs, or eGPRs) for use only by microcode. Scratch registers are storagelocations dedicated to the storage of operands of microcodeinstructions. In order to have a compact instruction encoding, mostprocessor instruction sets have a small set of special locations whichcan be directly named. These registers capable of being directly namedare called rename registers, and are storage locations for a futurestate of an architected register. Register renaming refers to atechnique used to avoid unnecessary serialization of program operationsimposed by the reuse of registers by those operations. One limitingperformance factor in an out-of-order microprocessor design is theavailability of GPR rename registers. Under the prior art, the totalnumber of rename registers available is equal to the total number ofphysical registers less the number of logical registers defined for eachthread, because the latest set of committed logical registers must bepreserved for the possibility that speculative out-of-order instructionsare flushed.

Speaking generically of the prior art, the relationship betweenavailable rename registers and physical registers is:Nrename=Nphysical−Nthreads*Nlogical. However, for the known solution-fora microprocessor performing out-of-order instructions with microcode(and thus scratch registers), the relationship becomesNrename=Nphysical−Nthreads*(Nlogical+Nscratch). The result of the priorart solution is that, for a microprocessor with multiple threads, thenumber of renames available for computation can become significantlyreduced, due in large measure to the prior-art solution for scratchregister handling.

SUMMARY OF THE INVENTION

A method and processor for performing on-demand scratch registerreallocation by dynamically adjusting the number of scratch registersfrom within the pool of rename registers includes initially allocatingfrom a set of physical registers one or more architected registers and apool of one or more rename registers and allocating from the pool ofrename registers an initial number of scratch registers for storingmicrocode operands. In response to detecting that a fetched instructionrequires an additional scratch register beyond the initial number, aselected physical register is reallocated from among the pool of renameregisters as the additional scratch register, and a flag is set toindicate the rename register is allocated as the additional scratchregister. In response to determining that the additional scratchregister is no longer needed, the additional scratch register isdeallocated and the flag is reset, such that the selected physicalregister returns to the pool of rename registers.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. The invention itself however, as well as apreferred mode of use, further objects and advantages thereof, will bestbe understood by reference to the following detailed description of anillustrative embodiment when read in conjunction with the accompanyingdrawings, wherein:

FIG. 1 depicts an illustrative embodiment of a data processing systemwith which the method and system for on-demand scratch register renamingof the present invention may advantageously be utilized;

FIG. 2 illustrates an exemplary embodiment of a physical general purposeregister file in accordance with a preferred embodiment of the presentinvention; and

FIG. 3 is a flowchart of a process for on-demand scratch registerrenaming in a processor in accordance with a preferred embodiment of thepresent invention.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT

With reference now to the figures and in particular with reference toFIG. 1, there is depicted a high level block diagram of an illustrativeembodiment of a data processing system for processing instructions anddata in accordance with the present invention. In particular, dataprocessing system 180 supports on-demand scratch register renaming.

Data processing system 180 includes multiple processors 10, each ofwhich comprises a single integrated circuit superscalar processor,which, as discussed further below, includes various execution units,registers, buffers, memories, and other functional units that are allformed of digital integrated circuitry. As illustrated in FIG. 1,processor 10 may be coupled to other devices, such as a system memory 12and a second processor 10, by an interconnect fabric 14 to form a largerdata processing system such as a workstation computer system. Processor10 also includes an on-chip multi-level cache hierarchy including aunified level two (L2) cache 16 and bifurcated level one (L1)instruction (I) and data (D) caches 18 and 20, respectively. As is wellknown to those skilled in the art, caches 16, 18 and 20 provide lowlatency access to cache lines corresponding to memory locations insystem memory 12.

Instructions are fetched and ordered for processing by instructionsequencing logic 13 within processor 10. In the depicted embodiment,instruction sequencing logic 13 includes an instruction fetch addressregister (IFAR) 30 that contains an effective address (EA) indicating acache line of instructions to be fetched from L1 I-cache 18 forprocessing. During each cycle, a new instruction fetch address may beloaded into IFAR 30 from one of three sources: branch prediction unit(BPU) 36, which provides speculative target path addresses resultingfrom the prediction of conditional branch instructions, globalcompletion table (GCT) 38, which provides sequential path addresses, andbranch execution unit (BEU) 92, which provides non-speculative addressesresulting from the resolution of predicted conditional branchinstructions. If hit/miss logic 22 determines, after translation of theEA contained in IFAR 30 by effective-to-real address translation (ERAT)32 and lookup of the real address (RA) in I-cache directory 34, that thecache line of instructions corresponding to the EA in IFAR 30 does notreside in L1 I-cache 18, then hit/miss logic 22 provides the RA to L2cache 16 as a request address via I-cache request bus 24. Such requestaddresses may also be generated by prefetch logic within L2 cache 16based upon recent access patterns. In response to a request address, L2cache 16 outputs a cache line of instructions, which is loaded intoprefetch buffer (PB) 28 and L1 I-cache 18 via I-cache reload bus 26,possibly after passing through optional predecode logic 144.

Once the cache line specified by the EA in IFAR 30 resides in L1 cache18, L1 I-cache 18 outputs the cache line to both branch prediction unit(BPU) 36 and to instruction fetch buffer (IFB) 40. BPU 36 scans thecache line of instructions for branch instructions and predicts theoutcome of conditional branch instructions, if any. Following a branchprediction, BPU 36 furnishes a speculative instruction fetch address toIFAR 30, as discussed above, and passes the prediction to branchinstruction queue 64 so that the accuracy of the prediction can bedetermined when the conditional branch instruction is subsequentlyresolved by branch execution unit 92.

IFB 40 temporarily buffers the cache line of instructions received fromL1 I-cache 18 until the cache line of instructions can be translated byinstruction translation unit (ITU) 42. In the illustrated embodiment ofprocessor 10, ITU 42 translates instructions from user instruction setarchitecture (UISA) instructions (e.g., PowerPC® instructions) into apossibly different number of internal ISA (IISA) instructions that aredirectly executable by the execution units of processor 10. Suchtranslation may be performed, for example, by reference to microcodestored in a read-only memory (ROM) template. In at least someembodiments, the UISA-to-IISA translation results in a different numberof IISA instructions than UISA instructions and/or IISA instructions ofdifferent lengths than corresponding UISA instructions. The resultantIISA instructions are then assigned by global completion table 38 to aninstruction group, the members of which are permitted to be executedout-of-order with respect to one another. Global completion table 38tracks each instruction group for which execution has yet to becompleted by at least one associated EA, which is preferably the EA ofthe oldest instruction in the instruction group.

Following UISA-to-IISA instruction translation, instructions aredispatched in-order to one of latches 44, 46, 48 and 50 according toinstruction type. That is, branch instructions and other conditionregister (CR) modifying instructions are dispatched to latch 44,fixed-point and load-store instructions are dispatched to either oflatches 46 and 48, and floating-point instructions are dispatched tolatch 50. Each instruction requiring a rename register for temporarilystoring execution results is then assigned one or more registers withina register file by the appropriate one of CR mapper 52, link and count(LC) register mapper 54, exception register (XER) mapper 56,general-purpose register (GPR) mapper 58, and floating-point register(FPR) mapper 60.

The dispatched instructions are then temporarily placed in anappropriate one of CR issue queue (CRIQ) 62, branch issue queue (BIQ)64, fixed-point issue queues (FXIQs) 66 and 68, and floating-point issuequeues (FPIQs) 70 and 72. From issue queues 62, 64, 66, 68, 70 and 72,instructions can be issued opportunistically (i.e., possiblyout-of-order) to the execution units of processor 10 for execution. Theinstructions, however, are maintained in issue queues 62-72 untilexecution of the instructions is complete and the result data, if any,are written back, in case any of the instructions needs to be reissued.

As illustrated, the execution units of processor 10 include a CR unit(CRU) 90 for executing CR-modifying instructions, a branch executionunit (BEU) 92 for executing branch instructions, two fixed-point units(FXUs) 94 and 100 for executing fixed-point instructions, two load-storeunits (LSUs) 96 and 98 for executing load and store instructions, andtwo floating-point units (FPUs) 102 and 104 for executing floating-pointinstructions. Each of execution units 90-104 is preferably implementedas an execution pipeline having a number of pipeline stages.

During execution within one of execution units 90-104, an instructionreceives operands, if any, from one or more architected and/or renameregisters within a register file coupled to the execution unit. Whenexecuting CR-modifying or CR-dependent instructions, CRU 90 and BEU 92access the CR register file 80, which in a preferred embodiment containsa CR and a number of CR rename registers that each comprise a number ofdistinct fields formed of one or more bits. Among these fields are LT,GT, and EQ fields that respectively indicate if a value (typically theresult or operand of an instruction) is less than zero, greater thanzero, or equal to zero. Link and count register (LCR) register file 82contains a count register (CTR), a link register (LR) and renameregisters of each, by which BEU 92 may also resolve conditional branchesto obtain a path address. General-purpose register files (GPRs) 84 and86, which are synchronized, duplicate register files, store fixed-pointand integer values accessed and produced by FXUs 94 and 100 and LSUs 96and 98. Floating-point register file (FPR) 88, which like GPRs 84 and 86may also be implemented as duplicate sets of synchronized registers,contains floating-point values that result from the execution offloating-point instructions by FPUs 102 and 104 and floating-point loadinstructions by LSUs 96 and 98.

After an execution unit finishes execution of an instruction, theexecution notifies GCT 38, which schedules completion of instructions inprogram order. To complete an instruction executed by one of CRU 90,FXUs 94 and 100 or FPUs 102 and 104, GCT 38 signals the appropriatemapper, which sets an indication to indicate that the register fileregister(s) assigned to the instruction now contains the architectedstate of the register. The instruction is then removed from the issuequeue, and once all instructions within its instruction group havecompleted, is removed from GCT 38. Other types of instructions, however,are completed differently.

When BEU 92 resolves a conditional branch instruction and determines thepath address of the execution path that should be taken, the pathaddress is compared against the speculative path address predicted byBPU 36. If the path addresses match, no further processing is required.If, however, the calculated path address does not match the predictedpath address, BEU 92 supplies the correct path address to IFAR 30. Ineither event, the branch instruction can then be removed from BIQ 64,and when all other instructions within the same instruction group havecompleted, from GCT 38.

Following execution of a load instruction (including a load-reserveinstruction), the effective address computed by executing the loadinstruction is translated to a real address by a data ERAT (notillustrated) and then provided to L1 D-cache 20 as a request address. Atthis point, the load operation is removed from FXIQ 66 or 68 and placedin load data queue (LDQ) 114 until the indicated load is performed. Ifthe request address misses in L1 D-cache 20, the request address isplaced in load miss queue (LMQ) 116, from which the requested data isretrieved from L2 cache 16, failing that, from another processor 10 orfrom system memory 12.

Store instructions (including store-conditional instructions) aresimilarly completed utilizing a store queue (STQ) 110 into whicheffective addresses for stores are loaded following execution of thestore instructions. From STQ 110, data can be stored into either or bothof L1 D-cache 20 and L2 cache 16, following effective-to-realtranslation of the target address.

As will be appreciated by those skilled in the art from the foregoingdescription, the instruction handling circuitry of processor 10 can thusbe considered an instruction pipeline in which instructions generallyflow from cache memory to instruction sequencing logic 13, to issuequeues 62-72, to execution units 90-104 and, for memory accessinstructions, to one of queues 110, 114, and 116, prior to completionand retirement from GCT 38.

To facilitate testing, processor 10 may optionally include conventionaltest circuitry 120 (e.g., an IEEE Std. 1149.1-compliant boundary scaninterface) coupled between the internal logic illustrated in FIG. 1 andthe input/output (I/O) pins of the chip package.

Referring now to FIG. 2, there is illustrated a more detaileddescription of a physical register file such as GPRs 84 and 86. Physicalregister file 200 is composed of physical registers 202-220. Physicalregisters 202-220 are allocated by GPR mapper 58 as architectedregisters 202-210, rename registers 212-220 and scratch registers218-220. Each of physical registers 202-220 contains a flag bit 222-240for indicating if the register is being used as a scratch register. Whenused as GPRs 84 and 86, physical registers 202-220 store fixed-point andinteger values accessed and produced by FXUs 94 and 100 and LSUs 96 and98. During execution within FXUs 94 and 100 and LSUs 96 and 98, aninstruction receives operands, if any, from one or more architectedregisters 202-210, and a pool 242 of rename registers 212-220, some ofwhich will be configurably designated as scratch registers 218-220within a physical register file 200 coupled to the execution unit FXUs94 and 100 and LSUs 96 and 98.

Turning now to FIG. 3, a flowchart for a process for on-demand scratchregister renaming in a processor in accordance with a preferredembodiment of the present invention is depicted. The process starts atstep 300 and then moves to step 302, which depicts GPR mapper 58 or XERmapper 56 initially allocating from a set of physical registers 202-220as one or more architected registers 202-210 and a pool 242 of one ormore rename registers 212-220, among which are an initial number ofscratch registers 218-220 for storing microcode operands. In a preferredembodiment, that initial number of scratch registers 218-220 will bezero. The process then proceeds to step 304. Step 304 illustrates GPRmapper 58 waiting on a signal from instruction sequencing logic 13indicating that an instruction has been fetched.

The process next moves to step 306, which depicts GPR mapper 58detecting, on the basis of input from ITU 42, whether the fetchedinstruction is targeted to a scratch register, requiring an additionalscratch register 218-220 beyond the initial allocation of scratchregisters 218-220. If GPR mapper 58 detects, on the basis of input fromITU 42, that the fetched instruction requires an additional scratchregister scratch register 218-220 beyond the initial allocation ofscratch registers 218-220, then the process proceeds to step 310. Step310 illustrates GPR mapper 58 reallocating a selected physical register218 from among the pool 242 of rename registers 212-220 as an additionalscratch register 218-220. The process next moves to step 312, whichdepicts GPR mapper 58 setting a flag 232-240 on the selected physicalregister 218 from among the pool 242 of rename registers 212-220 (e.g.,by setting a single-bit latch) to indicate that the selected physicalregister 218 is used as an additional scratch register 218-220. Theprocess then proceeds to step 308.

Returning to step 306, if GPR mapper 58 detects, on the basis of inputfrom ITU 42, that the fetched instruction does not require an additionalscratch register scratch register 218-220 beyond the initial allocationof scratch registers 218-220, then the process next moves to step 308.Step 308 illustrates GPR mapper 58 waiting for a signal from one of FXUs94 and 100 and LSUs 96 and 98 indicating that the fetched instructionhas been completed. The process then proceeds to step 314, which depictsGPR mapper 58 determining, on the basis of input from ITU 42, whetherany unexecuted instructions requiring additional scratch registers218-220 beyond the initial allocation of scratch registers 218-220remain in the thread that includes the fetched instruction. If GPRmapper 58 determines, on the basis of input from ITU 42, that anyunexecuted instructions requiring additional scratch registers 218-220beyond the initial allocation of scratch registers 218-220 remain in thethread that includes the fetched instruction, then the process returnsto step 308.

Returning to step 314, GPR mapper 58 determines, on the basis of inputfrom ITU 42 indicating detection by ITU 42, that no unexecutedinstructions requiring additional scratch registers 218-220 beyond theinitial allocation of scratch registers 218-220 remain in the threadthat includes the fetched instruction, then the process next moves tostep 316. Step 316 illustrates GPR mapper 58 deallocating the additionalscratch register 220 and resetting flags 240, such that the selectedphysical registers 220 return to the pool 242 of rename registers212-220. The process then returns to step 304.

While the invention has been particularly shown and described withreference to a preferred embodiment, it will be understood by thoseskilled in the art that various changes in form and detail may be madetherein without departing from the spirit and scope of the invention.For example, the present invention is not limited to a particularprocessor architecture orto processor architecture, but is applicable toany processor architecture.

1. A processor with reallocation, said processor comprising: aninstruction-sequencing unit; an execution unit; a set of physicalregisters including one or more architected registers and a pool ofmultiple rename registers; and a mapper that allocates from said pool ofrename registers an initial number of scratch registers for storingmicrocode operands, wherein responsive to detecting that a fetchedinstruction is targeted to a scratch register requiring an additionalscratch register beyond said initial number, said mapper reallocates aselected physical register from among said pool of rename registers assaid additional scratch register and sets a flag to indicate said renameregister is allocated as said additional scratch register, and wherein,responsive to determining that said additional scratch register is nolonger needed, deallocating said additional scratch register andresetting said flag, such that said selected physical register returnsto said pool of rename registers.
 2. The processor of claim 8, whereinsaid mapper reallocates a selected physical register from among saidrename registers as said additional scratch register at a dispatch ofsaid fetched instruction.
 3. The processor of claim 8, furthercomprising an instruction translation unit for detecting said additionalscratch register is no longer needed.
 4. The processor of claim 10,wherein said instruction translation unit that detects said additionalscratch register is no longer needed further comprises means fordetermining that no unexecuted instructions remain in a threadcontaining said fetched instruction requiring said additional scratchregister.
 5. The processor of claim 8, further comprising a single-bitlatch.
 6. The processor of claim 8, further comprising an instructiontranslation unit that detects that said fetched instruction requiressaid additional scratch register beyond said initial number.
 7. Theprocessor of claim 8, further comprising means within said mapper forallocating from said pool of rename registers zero scratch registers forstoring microcode operands.
 8. A machine-readable medium having aplurality of instructions processable by a machine embodied therein,wherein said plurality of instructions, when processed by said machine,causes said machine to perform a method, comprising: initiallyallocating from a set of physical registers one or more architectedregisters and a pool of multiple rename registers; allocating from saidpool of rename registers an initial number of scratch registers forstoring microcode operands; in response to detecting that a fetchedinstruction is targeted to a scratch register requiring an additionalscratch register beyond said initial number, reallocating a selectedphysical register from among said pool of rename registers as saidadditional scratch register; setting a flag to indicate said renameregister is allocated as said additional scratch register; and inresponse to determining that said additional scratch register is nolonger needed, deallocating said additional scratch register andresetting said flag, such that said selected physical register returnsto said pool of rename registers.
 9. The machine-readable medium ofclaim 15, wherein said step of reallocating said selected physicalregister from among said rename registers as said additional scratchregister further comprises reallocating a selected physical registerfrom among said rename registers as said additional scratch register ata dispatch of said fetched instruction.
 10. The machine-readable mediumof claim 15, further wherein said method further comprises determiningthat said additional scratch register is no longer needed.
 11. Themachine-readable medium of claim 15, wherein said step of determiningthat said additional scratch register is no longer needed furthercomprises determining that no unexecuted instructions remain in a threadcontaining said fetched instruction requiring said additional scratchregister.
 12. The machine-readable medium of claim 15, wherein said stepof setting said flag further comprises setting a single-bit latch. 13.The machine-readable medium of claim 15, wherein said method furthercomprises, detecting that said fetched instruction requires saidadditional scratch register beyond said initial number.